{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LangSmith Custom LLM Evaluation\n",
    "\n",
    "- Author: [HeeWung Song(Dan)](https://github.com/kofsitho87)\n",
    "- Design: \n",
    "- Peer Review: \n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/06-MarkdownHeaderTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/06-MarkdownHeaderTextSplitter.ipynb)\n",
    "\n",
    "## Overview\n",
    "\n",
    "**LangSmith Custom LLM Evaluation** is a customizable evaluation framework in LangChain that enables users to assess LLM application outputs based on their specific requirements.\n",
    "\n",
    "1. **Custom Evaluation Logic**: \n",
    "   - Define your own evaluation criteria\n",
    "   - Create specific scoring mechanisms\n",
    "\n",
    "2. **Easy Integration**:\n",
    "   - Works with LangChain's RAG systems\n",
    "   - Compatible with LangSmith for evaluation tracking\n",
    "\n",
    "3. **Evaluation Methods**:\n",
    "   - Simple metric-based evaluation\n",
    "   - Advanced LLM-based assessment\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [RAG System Setup](#rag-system-setup)\n",
    "- [Basic Custom Evaluator](#basic-custom-evaluator)\n",
    "- [Custom LLM-as-Judge](#custom-llm-as-judge)\n",
    "\n",
    "\n",
    "### References\n",
    "\n",
    "- [LangChain Get started with LangSmith](https://docs.smith.langchain.com/)\n",
    "- [LangChain How to define a custom evaluator](https://docs.smith.langchain.com/evaluation/how_to_guides/custom_evaluator)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.\n",
    "\n",
    "**[Note]**\n",
    "- The `langchain-opentutorial` is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.\n",
    "- Check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langsmith\",\n",
    "        \"langchain\",\n",
    "        \"langchain_core\",\n",
    "        \"langchain_community\",\n",
    "        \"langchain_openai\",\n",
    "        \"pymupdf\",\n",
    "        \"faiss-cpu\",\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"LangSmith-Custom-LLM-Evaluation\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can set and load `OPENAI_API_KEY` from a `.env` file.\n",
    "\n",
    "**[Note]** This is only necessary if you haven't already set `OPENAI_API_KEY` in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RAG System Setup\n",
    "\n",
    "We will build a basic **RAG** (Retrieval-Augmented Generation) system to test **Custom Evaluators**. This implementation creates a question-answering system based on PDF documents, which will serve as our foundation for evaluation purposes.\n",
    "\n",
    "This **RAG** system will be used to evaluate answer quality and accuracy through **Custom Evaluators** in later sections.\n",
    "\n",
    "### RAG System Preparation\n",
    "\n",
    "1. **Document Processing**\n",
    "   - `load_documents()`: Loads PDF documents using `PyMuPDFLoader`\n",
    "   - `split_documents()`: Splits documents into appropriate sizes using `RecursiveCharacterTextSplitter`\n",
    "\n",
    "2. **Vector Store Creation**\n",
    "   - `create_vectorstore()`: Creates vector DB using `OpenAIEmbeddings` and `FAISS`\n",
    "   - `create_retriever()`: Generates a retriever based on the vector store\n",
    "\n",
    "3. **QA Chain Configuration**\n",
    "   - `create_chain()`: Creates a chain that answers questions based on retrieved context\n",
    "   - Includes prompt template for question-answering tasks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from myrag import PDFRAG\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "# Create PDFRAG object\n",
    "rag = PDFRAG(\n",
    "    \"data/Newwhitepaper_Agents2.pdf\",\n",
    "    ChatOpenAI(model=\"gpt-4o-mini\", temperature=0),\n",
    ")\n",
    "\n",
    "# Create Retriever\n",
    "retriever = rag.create_retriever()\n",
    "\n",
    "# Create Chain\n",
    "chain = rag.create_chain(retriever)\n",
    "\n",
    "# Generate answer for question\n",
    "chain.invoke(\"List up the name of the authors\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll create a function called `ask_question` that takes a dictionary `inputs` as a parameter and returns a dictionary with an `answer` key. This function will serve as our question-answering interface."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create function to answer question\n",
    "def ask_question(inputs: dict):\n",
    "    return {\"answer\": chain.invoke(inputs[\"question\"])}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Custom Evaluator\n",
    "\n",
    "Let's explore the fundamental concepts of creating **Custom Evaluators**. **Custom Evaluators** are evaluation tools in LangChain's LangSmith evaluation system that users can define according to their specific requirements. LangSmith provides a comprehensive platform for monitoring, evaluating, and improving LLM applications.\n",
    "\n",
    "### Understanding Evaluator Arguments\n",
    "\n",
    "Custom Evaluator functions can use the following arguments:\n",
    "\n",
    "- `run (Run)`: The complete Run object generated by the application\n",
    "- `example (Example)`: Dataset example containing inputs, outputs, and metadata\n",
    "- `inputs (dict)`: Input dictionary for a single example from the dataset\n",
    "- `outputs (dict)`: Output dictionary generated by the application for given inputs\n",
    "- `reference_outputs (dict)`: Reference output dictionary associated with the example\n",
    "\n",
    "In most cases, `inputs`, `outputs`, and `reference_outputs` are sufficient. The `run` and `example` objects are only needed when additional metadata is required.\n",
    "\n",
    "### Understanding Output Types\n",
    "\n",
    "**Custom Evaluators** can return results in the following formats:\n",
    "\n",
    "1. **Dictionary Format** (Recommended)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'{\"key\": \"metric_name\", \"score\": value}'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"{\\\"key\\\": \\\"metric_name\\\", \\\"score\\\": value}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. **Basic Types** (Python)\n",
    "   - `int`, `float`, `bool`: Continuous numerical metrics\n",
    "   - `str`: Categorical metrics\n",
    "   \n",
    "3. **Multiple Metrics**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'[{\"key\": \"metric1\", \"score\": value1}, {\"key\": \"metric2\", \"score\": value2}]'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"[{\\\"key\\\": \\\"metric1\\\", \\\"score\\\": value1}, {\\\"key\\\": \\\"metric2\\\", \\\"score\\\": value2}]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Score Evaluator Example\n",
    "\n",
    "Now, let's create a simple **Custom Evaluator** example. This evaluator will return a random score between 1 and 10, regardless of the answer content.\n",
    "\n",
    "**Random Score Evaluator Implementation**\n",
    "- Takes `Run` and `Example` objects as input parameters\n",
    "- Returns a dictionary in the format: `{\\\"key\\\": \\\"random_score\\\", \\\"score\\\": score}`\n",
    "\n",
    "Here's the basic implementation of a random score evaluator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith.schemas import Run, Example\n",
    "import random\n",
    "\n",
    "\n",
    "def random_score_evaluator(run: Run, example: Example) -> dict:\n",
    "    # Return random score\n",
    "    score = random.randint(1, 10)\n",
    "    return {\"key\": \"random_score\", \"score\": score}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'CUSTOM-EVAL-565330e1' at:\n",
      "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=d0296986-a186-4dc6-a327-659c1e00169c\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2161895e1dac47658d1fca9155f0e5ce",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "# Set dataset name\n",
    "dataset_name = \"RAG_EVAL_DATASET\"\n",
    "\n",
    "# Run\n",
    "experiment_results = evaluate(\n",
    "    ask_question,\n",
    "    data=dataset_name,\n",
    "    evaluators=[random_score_evaluator],\n",
    "    experiment_prefix=\"CUSTOM-EVAL\",\n",
    "    # Set experiment metadata\n",
    "    metadata={\n",
    "        \"variant\": \"Random Score Evaluator\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inputs.question</th>\n",
       "      <th>outputs.answer</th>\n",
       "      <th>error</th>\n",
       "      <th>reference.answer</th>\n",
       "      <th>feedback.random_score</th>\n",
       "      <th>execution_time</th>\n",
       "      <th>example_id</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What are the three targeted learnings to enhan...</td>\n",
       "      <td>The three targeted learnings to enhance model ...</td>\n",
       "      <td>None</td>\n",
       "      <td>The three targeted learning approaches to enha...</td>\n",
       "      <td>4</td>\n",
       "      <td>3.112384</td>\n",
       "      <td>0e661de4-636b-425d-8f6e-0a52b8070576</td>\n",
       "      <td>ae36f6a7-86a2-4f0a-89d2-8be9671ca3cb</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are the key functions of an agent's orche...</td>\n",
       "      <td>The key functions of an agent's orchestration ...</td>\n",
       "      <td>None</td>\n",
       "      <td>The key functions of an agent's orchestration ...</td>\n",
       "      <td>6</td>\n",
       "      <td>4.077394</td>\n",
       "      <td>3561c6fe-6ed4-4182-989a-270dcd635f32</td>\n",
       "      <td>6c65f286-a103-4a60-b906-555fd405ea7e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>List up the name of the authors</td>\n",
       "      <td>The authors are Julia Wiesinger, Patrick Marlo...</td>\n",
       "      <td>None</td>\n",
       "      <td>The authors are Julia Wiesinger, Patrick Marlo...</td>\n",
       "      <td>7</td>\n",
       "      <td>1.172011</td>\n",
       "      <td>b03e98d1-44ad-4142-8dfa-7b0a31a57096</td>\n",
       "      <td>429dad1e-f68c-4f67-ae36-cc2171c4c6a0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is Tree-of-thoughts?</td>\n",
       "      <td>Tree-of-thoughts (ToT) is a prompt engineering...</td>\n",
       "      <td>None</td>\n",
       "      <td>Tree-of-thoughts (ToT) is a prompt engineering...</td>\n",
       "      <td>5</td>\n",
       "      <td>1.374912</td>\n",
       "      <td>be18ec98-ab18-4f30-9205-e75f1cb70844</td>\n",
       "      <td>be337bef-90b0-4b6a-b9ab-941562ab4b44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>What is the framework used for reasoning and p...</td>\n",
       "      <td>The frameworks used for reasoning and planning...</td>\n",
       "      <td>None</td>\n",
       "      <td>The frameworks used for reasoning and planning...</td>\n",
       "      <td>7</td>\n",
       "      <td>1.821961</td>\n",
       "      <td>eb4b29a7-511c-4f78-a08f-2d5afeb84320</td>\n",
       "      <td>9cff92b1-04e7-49f5-ab2a-85763468e6cb</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>How do agents differ from standalone language ...</td>\n",
       "      <td>Agents differ from standalone language models ...</td>\n",
       "      <td>None</td>\n",
       "      <td>Agents can use tools to access real-time data ...</td>\n",
       "      <td>1</td>\n",
       "      <td>2.135424</td>\n",
       "      <td>f4a5a0cf-2d2e-4e15-838a-bc8296eb708b</td>\n",
       "      <td>3fbe6fa6-88bf-46de-bdfa-0f39eac18c78</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     inputs.question  \\\n",
       "0  What are the three targeted learnings to enhan...   \n",
       "1  What are the key functions of an agent's orche...   \n",
       "2                    List up the name of the authors   \n",
       "3                          What is Tree-of-thoughts?   \n",
       "4  What is the framework used for reasoning and p...   \n",
       "5  How do agents differ from standalone language ...   \n",
       "\n",
       "                                      outputs.answer error  \\\n",
       "0  The three targeted learnings to enhance model ...  None   \n",
       "1  The key functions of an agent's orchestration ...  None   \n",
       "2  The authors are Julia Wiesinger, Patrick Marlo...  None   \n",
       "3  Tree-of-thoughts (ToT) is a prompt engineering...  None   \n",
       "4  The frameworks used for reasoning and planning...  None   \n",
       "5  Agents differ from standalone language models ...  None   \n",
       "\n",
       "                                    reference.answer  feedback.random_score  \\\n",
       "0  The three targeted learning approaches to enha...                      4   \n",
       "1  The key functions of an agent's orchestration ...                      6   \n",
       "2  The authors are Julia Wiesinger, Patrick Marlo...                      7   \n",
       "3  Tree-of-thoughts (ToT) is a prompt engineering...                      5   \n",
       "4  The frameworks used for reasoning and planning...                      7   \n",
       "5  Agents can use tools to access real-time data ...                      1   \n",
       "\n",
       "   execution_time                            example_id  \\\n",
       "0        3.112384  0e661de4-636b-425d-8f6e-0a52b8070576   \n",
       "1        4.077394  3561c6fe-6ed4-4182-989a-270dcd635f32   \n",
       "2        1.172011  b03e98d1-44ad-4142-8dfa-7b0a31a57096   \n",
       "3        1.374912  be18ec98-ab18-4f30-9205-e75f1cb70844   \n",
       "4        1.821961  eb4b29a7-511c-4f78-a08f-2d5afeb84320   \n",
       "5        2.135424  f4a5a0cf-2d2e-4e15-838a-bc8296eb708b   \n",
       "\n",
       "                                     id  \n",
       "0  ae36f6a7-86a2-4f0a-89d2-8be9671ca3cb  \n",
       "1  6c65f286-a103-4a60-b906-555fd405ea7e  \n",
       "2  429dad1e-f68c-4f67-ae36-cc2171c4c6a0  \n",
       "3  be337bef-90b0-4b6a-b9ab-941562ab4b44  \n",
       "4  9cff92b1-04e7-49f5-ab2a-85763468e6cb  \n",
       "5  3fbe6fa6-88bf-46de-bdfa-0f39eac18c78  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment_results.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![CUSTOM-EVAL-FOR-RANDOM-SCORE](./assets/07-LangSmith-Custom-LLM-Evaluation-01.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom LLM-as-Judge\n",
    "\n",
    "Now, we'll create a LLM Chain to use as an evaluator. \n",
    "\n",
    "First, let's define a function that returns `context`, `answer`, and `question`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to return RAG results with `context`, `answer`, and `question`\n",
    "def context_answer_rag_answer(inputs: dict):\n",
    "    # Get context from Vector Store Retriever\n",
    "    context = retriever.invoke(inputs[\"question\"])\n",
    "    # Get answer from RAG Chain in PDFRAG\n",
    "    answer = chain.invoke(inputs[\"question\"])\n",
    "    return {\n",
    "        \"context\": \"\\n\".join([doc.page_content for doc in context]),\n",
    "        \"answer\": answer,\n",
    "        \"question\": inputs[\"question\"],\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run our evaluation using LangSmith's evaluate function. We'll use our custom evaluator to assess the RAG system's performance across our test dataset.\n",
    "\n",
    "We'll use the `teddynote/context-answer-evaluator` prompt template from LangChain Hub, which provides a structured evaluation framework for RAG systems.\n",
    "\n",
    "The evaluator uses the following criteria:\n",
    "- **Accuracy (0-10)**: How well the answer aligns with the context\n",
    "- **Comprehensiveness (0-10)**: How complete and detailed the answer is\n",
    "- **Context Precision (0-10)**: How effectively the context information is used\n",
    "\n",
    "The final score is normalized to a 0-1 scale using the formula:\n",
    "`Final Score = (Accuracy + Comprehensiveness + Context Precision) / 30`\n",
    "\n",
    "This evaluation framework helps us quantitatively assess the quality of our RAG system's responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "As an LLM evaluator (judge), please assess the LLM's response to the given question. Evaluate the response's accuracy, comprehensiveness, and context precision based on the provided context. After your evaluation, return only the numerical scores in the following format:\n",
      "Accuracy: [score]\n",
      "Comprehensiveness: [score]\n",
      "Context Precision: [score]\n",
      "Final: [normalized score]\n",
      "Grading rubric:\n",
      "\n",
      "Accuracy (0-10 points):\n",
      "Evaluate how well the answer aligns with the information provided in the given context.\n",
      "\n",
      "0 points: The answer is completely inaccurate or contradicts the provided context\n",
      "4 points: The answer partially aligns with the context but contains significant inaccuracies\n",
      "7 points: The answer mostly aligns with the context but has minor inaccuracies or omissions\n",
      "10 points: The answer fully aligns with the provided context and is completely accurate\n",
      "\n",
      "\n",
      "Comprehensiveness (0-10 points):\n",
      "\n",
      "0 points: The answer is completely inadequate or irrelevant\n",
      "3 points: The answer is accurate but too brief to fully address the question\n",
      "7 points: The answer covers main aspects but lacks detail or misses minor points\n",
      "10 points: The answer comprehensively covers all aspects of the question\n",
      "\n",
      "\n",
      "Context Precision (0-10 points):\n",
      "Evaluate how precisely the answer uses the information from the provided context.\n",
      "\n",
      "0 points: The answer doesn't use any information from the context or uses it entirely incorrectly\n",
      "4 points: The answer uses some information from the context but with significant misinterpretations\n",
      "7 points: The answer uses most of the relevant context information correctly but with minor misinterpretations\n",
      "10 points: The answer precisely and correctly uses all relevant information from the context\n",
      "\n",
      "\n",
      "Final Normalized Score:\n",
      "Calculate by summing the scores for accuracy, comprehensiveness, and context precision, then dividing by 30 to get a score between 0 and 1.\n",
      "Formula: (Accuracy + Comprehensiveness + Context Precision) / 30\n",
      "\n",
      "#Given question:\n",
      "\u001b[33;1m\u001b[1;3m{question}\u001b[0m\n",
      "\n",
      "#LLM's response:\n",
      "\u001b[33;1m\u001b[1;3m{answer}\u001b[0m\n",
      "\n",
      "#Provided context:\n",
      "\u001b[33;1m\u001b[1;3m{context}\u001b[0m\n",
      "\n",
      "Please evaluate the LLM's response according to the criteria above. \n",
      "\n",
      "In your output, include only the numerical scores for FINAL NORMALIZED SCORE without any additional explanation or reasoning.\n",
      "ex) 0.81\n",
      "\n",
      "#Final Normalized Score(Just the number):\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain import hub\n",
    "\n",
    "# Get evaluator Prompt\n",
    "llm_evaluator_prompt = hub.pull(\"teddynote/context-answer-evaluator\")\n",
    "llm_evaluator_prompt.pretty_print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "# Create evaluator\n",
    "custom_llm_evaluator = (\n",
    "    llm_evaluator_prompt\n",
    "    | ChatOpenAI(temperature=0.0, model=\"gpt-4o-mini\")\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's evaluate our system using the previously created `context_answer_rag_answer` function. We'll pass the generated answer and context to our `custom_llm_evaluator` for assessment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.87'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate answer\n",
    "output = context_answer_rag_answer(\n",
    "    {\"question\": \"What are the three targeted learnings to enhance model performance?\"}\n",
    ")\n",
    "\n",
    "# Run evaluator\n",
    "custom_llm_evaluator.invoke(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define our `custom_evaluator` function.\n",
    "\n",
    "- `run.outputs`: Gets the `answer`, `context`, and `question` generated by the RAG chain\n",
    "- `example.outputs`: Gets the reference answer from our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith.schemas import Run, Example\n",
    "\n",
    "def custom_evaluator(run: Run, example: Example) -> dict:\n",
    "    # Get LLM generated answer and reference answer\n",
    "    llm_answer = run.outputs.get(\"answer\", \"\")\n",
    "    context = run.outputs.get(\"context\", \"\")\n",
    "    question = example.outputs.get(\"question\", \"\")\n",
    "\n",
    "    # Return custom score\n",
    "    score = custom_llm_evaluator.invoke(\n",
    "        {\"question\": question, \"answer\": llm_answer, \"context\": context}\n",
    "    )\n",
    "    return {\"key\": \"custom_score\", \"score\": float(score)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run our evaluation using LangSmith's evaluate function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'CUSTOM-LLM-EVAL-e33ee0a7' at:\n",
      "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=156ad2c4-b8ec-4ada-b76c-44b09a527b50\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5d368ec85b944a71be33143061c0ac5f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "# Set dataset name\n",
    "dataset_name = \"RAG_EVAL_DATASET\"\n",
    "\n",
    "# Run\n",
    "experiment_results = evaluate(\n",
    "    context_answer_rag_answer,\n",
    "    data=dataset_name,\n",
    "    evaluators=[custom_evaluator],\n",
    "    experiment_prefix=\"CUSTOM-LLM-EVAL\",\n",
    "    # Set experiment metadata\n",
    "    metadata={\n",
    "        \"variant\": \"Evaluation using Custom LLM Evaluator\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inputs.question</th>\n",
       "      <th>outputs.context</th>\n",
       "      <th>outputs.answer</th>\n",
       "      <th>outputs.question</th>\n",
       "      <th>error</th>\n",
       "      <th>reference.answer</th>\n",
       "      <th>feedback.custom_score</th>\n",
       "      <th>execution_time</th>\n",
       "      <th>example_id</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What are the three targeted learnings to enhan...</td>\n",
       "      <td>Agents\\n33\\nSeptember 2024\\nEnhancing model pe...</td>\n",
       "      <td>The three targeted learnings to enhance model ...</td>\n",
       "      <td>What are the three targeted learnings to enhan...</td>\n",
       "      <td>None</td>\n",
       "      <td>The three targeted learning approaches to enha...</td>\n",
       "      <td>0.87</td>\n",
       "      <td>3.603254</td>\n",
       "      <td>0e661de4-636b-425d-8f6e-0a52b8070576</td>\n",
       "      <td>85ddbfcb-8c49-4551-890a-f137d7b413b8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are the key functions of an agent's orche...</td>\n",
       "      <td>implementation of the agent orchestration laye...</td>\n",
       "      <td>The key functions of an agent's orchestration ...</td>\n",
       "      <td>What are the key functions of an agent's orche...</td>\n",
       "      <td>None</td>\n",
       "      <td>The key functions of an agent's orchestration ...</td>\n",
       "      <td>0.93</td>\n",
       "      <td>4.028933</td>\n",
       "      <td>3561c6fe-6ed4-4182-989a-270dcd635f32</td>\n",
       "      <td>0b423bb6-c722-41af-ae6e-c193ebc3ff8a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>List up the name of the authors</td>\n",
       "      <td>Agents\\nAuthors: Julia Wiesinger, Patrick Marl...</td>\n",
       "      <td>The authors are Julia Wiesinger, Patrick Marlo...</td>\n",
       "      <td>List up the name of the authors</td>\n",
       "      <td>None</td>\n",
       "      <td>The authors are Julia Wiesinger, Patrick Marlo...</td>\n",
       "      <td>0.87</td>\n",
       "      <td>1.885114</td>\n",
       "      <td>b03e98d1-44ad-4142-8dfa-7b0a31a57096</td>\n",
       "      <td>54e0987b-502f-48a7-877f-4b3d56bd82cf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is Tree-of-thoughts?</td>\n",
       "      <td>weaknesses depending on the specific applicati...</td>\n",
       "      <td>Tree-of-thoughts (ToT) is a prompt engineering...</td>\n",
       "      <td>What is Tree-of-thoughts?</td>\n",
       "      <td>None</td>\n",
       "      <td>Tree-of-thoughts (ToT) is a prompt engineering...</td>\n",
       "      <td>0.87</td>\n",
       "      <td>1.732563</td>\n",
       "      <td>be18ec98-ab18-4f30-9205-e75f1cb70844</td>\n",
       "      <td>f0b02411-b377-4eaa-821a-2108b8b4836f</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>What is the framework used for reasoning and p...</td>\n",
       "      <td>reasoning frameworks (CoT, ReAct, etc.) to \\nf...</td>\n",
       "      <td>The frameworks used for reasoning and planning...</td>\n",
       "      <td>What is the framework used for reasoning and p...</td>\n",
       "      <td>None</td>\n",
       "      <td>The frameworks used for reasoning and planning...</td>\n",
       "      <td>0.83</td>\n",
       "      <td>2.651672</td>\n",
       "      <td>eb4b29a7-511c-4f78-a08f-2d5afeb84320</td>\n",
       "      <td>38d34eb6-1ec5-44ea-a7d0-c7c98d46b0bc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>How do agents differ from standalone language ...</td>\n",
       "      <td>1.\\t Agents extend the capabilities of languag...</td>\n",
       "      <td>Agents differ from standalone language models ...</td>\n",
       "      <td>How do agents differ from standalone language ...</td>\n",
       "      <td>None</td>\n",
       "      <td>Agents can use tools to access real-time data ...</td>\n",
       "      <td>0.93</td>\n",
       "      <td>2.519094</td>\n",
       "      <td>f4a5a0cf-2d2e-4e15-838a-bc8296eb708b</td>\n",
       "      <td>49b26b38-e499-4c71-bdcb-eccfa44a1beb</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "<ExperimentResults CUSTOM-LLM-EVAL-e33ee0a7>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment_results.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![CUSTOM-EVAL-FOR-CUSTOM-SCORE](./assets/07-LangSmith-Custom-LLM-Evaluation-02.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
